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Abstract 

Naturally evolving proteins gradually accumulate mutations while continuing to fold to sta- 
ble structures. This process of neutral evolution is an important mode of genetic change, and 
forms the basis for the molecular clock. We present a mathematical theory that predicts the 
number of accumulated mutations, the index of dispersion, and the distribution of stabilities in 
an evolving protein population from knowledge of the stability effects (AAG values) for single 
mutations. Our theory quantitatively describes how neutral evolution leads to marginally stable 
proteins, and provides formulae for calculating how fluctuations in stability can overdisperse 
the molecular clock. It also shows that the structural influences on the rate of sequence evo- 
lution observed in earlier simulations can be calculated using just the single-mutation AAG 
values. We consider both the case when the product of the population size and mutation rate 
is small and the case when this product is large, and show that in the latter case the proteins 
evolve excess mutational robustness that is manifested by extra stability and an increase in the 
rate of sequence evolution. All our theoretical predictions are confirmed by simulations with 
lattice proteins. Our work provides a mathematical foundation for understanding how protein 
biophysics shapes the process of evolution. 
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INTRODUCTION 

Proteins evolve largely through the slow accumulation of amino acid substitutions. Over evo- 
lutionary time, this process of sequence divergence creates homologous proteins that differ 
at the majority of their residues, yet still fold to similar structures that often perform con- 
served biochemical functions (ILESK and CHOTHIA 19801) . The maintenance of structure and 
function during sequence divergence suggests that much of protein evolution is neutral in 
the sense that observed sequence changes frequently do not alter a protein's ability to fold 
and adequately perform the biochemical function necessary to enable its host organism to 
survive. This comparative evidence for neutrality in protein evolution has been corrobo- 
rated by experimental studies showing that the mutations separating diverged sequences of- 
ten have no effect other than modest and additive changes to stability (ISERRANO et al. 19931) , 
and that a large fraction of random mutations do not detectably alter a protein's structure or 
function (ISHORTLE and LlN 19851 IPAKULA et al. 19861 ILOEB et al. 19891 IGUO et al. 2004t 
IBloom et al. 20051 IB LOOM et al. 20061) . In this respect, it seems that protein evolution should 
be well described by Kimura's neutral theory of evolution, which holds that most genetic 
change is due to the stochastic fixation of neutral mutations (IKlMURA 1983b . One of the 
key predictions of the neutral theory is that assuming a constant mutation rate, the num- 
ber of mutations separating two proteins should be proportional to the time since their di- 
vergence (IKimura 1983b . Indeed, the observation by |Zuckerkandl and Pauling (1965b 



that proteins are "molecular clocks" that accumulate mutations at a roughly constant rate 
has long been taken as one of the strongest pieces of evidence supporting the neutral the- 
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ory (IOhta and Kimura 19711) . 

However, mutations that are neutral with respect to a protein's capacity to perform its bio- 
logical function often affect protein thermodynamics. The biological functions of most proteins 
depend on their ability to fold to thermodynamically stable native structures dANFlNSEN 19731) . 
Yet natural proteins are typically only marginally stable, with free energies of folding (AG f) 
between -5 and -15 kcal/mol (IFERSHT 19991 ). Most random mutations to proteins are destabi- 
lizing (IGODOY-Ruiz et al. 2004t IPakula et al. 1986t IMatthews 19931 IKumar et al. 20061) , 
and their effects on stability (measured as A AG, the AG / of the mutant protein minus the AG f 
of the wildtype protein) are frequently of the same magnitude as a protein's net stability. The 
impact of a mutation on a protein's function can therefore depend on the protein's stability: a 
moderately destabilizing mutation that is easily tolerated by a stable parent protein may com- 
pletely disrupt the folding of a less stable parent. This effect of protein stability on mutational 
tolerance has been verified by experiments demonstrating that more stable protein variants are 
markedly more robust to random mutations ABLOOM et al. 20051 IBLOOM et al. 20061) . 

The fact that mutations that are neutral with respect to direct selection for protein function 
can affect a protein's tolerance to subsequent mutations is not consistent with the simplest for- 
mulation of the neutral theory of evolution, which tends to assume that the fraction of mutations 



that is neutral remains constant in time. |KlMURA (1987[ ) himself recognized the possibility that 



the neutrality might change, and TAKAHATA (1987) mathematically treated the consequences 



of a "fluctuating neutral space". In particular, Takahata showed that fluctuating neutrality could 
explain the observed overdispersion in the molecular clock (|CUTLER 2000bl) (the tendency for 
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the variance in the number of fixed mutations to exceed the expectation for the Poisson process 
predicted by the neutral theory) long considered troublesome for the neutral theory. However, 
further progress on this topic was stymied by the lack of a specific model for how or why 
protein neutrality might fluctuate. 

More recently, researchers have preferred to describe neutral evolution using the con- 
cept of "neutral networks," which are networks in the space of possible protein sequences 
in which each functional protein is linked to all other functional proteins that differ by only a 
single mutation (ISmith 1970UHUYNEN et al. 19961 IGovindarajan and Goldstein 1997t 
Ivan Nimwegen etal. 1999tlB0RNB erg-Bauer and Chan 19991ITiana etal. 20001 IB astolla et al. 20021) . 
A neutrally evolving protein population is then envisioned as moving on the neutral net- 
work, and the neutrality of the population may fluctuate if the nodes on the network differ 
in their connectivities. A general theoretical treatment of evolution on neutral networks by 



VAN Nimwegen et al. (1999] ) has shown that if the product of the population size and muta- 
tion rate is small then members of the population are equally likely to occupy any node, while if 
this product is large then the population will preferentially occupy highly connected nodes (see 
also (IBornberg-Bauer and Chan 1999UTaverna and Goldstein 2002bH"XiA and Levitt 20041) ). 
Simulations with simplified lattice models of proteins have attempted to provide insight into 
the specific features of protein neutral networks. These simulations have shown that lattice 
protein neutral networks are centered around highly connected nodes occupied by stable pro- 
teins (IBornberg-Bauer and Chan 19991 IXia and Levitt 20041 IWingreen et al. 20041 
IBROGLIA et al. 19991 , a finding consistent with the experimental observation (IBLOOM et al. 20051 
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IBloom et al. 20061) that stable proteins are more mutationally robust. Lattice protein stud- 
ies also suggest that protein structures differ in their "designabilities" (defined as the total 
number of sequences that fold into a structure), and that sequences that fold into more des- 
ignable structures will neutrally evolve at a faster rate due to the increased size and con- 
nectivity of their neutral networks (IGOVINDARAJAN and GOLDSTEIN 19971 ILl et al. 19~96t 
England and Shakhnovich 2003HChan and Bornberg-Bauer 20021IWingreen etal. 2004b . 
Finally, simulations have demonstrated that fluctuations in neutrality as a protein population 
moves along its neutral network can lead to an overdispersion of the molecular clock (|BASTOLLA et al. 2002|) . 
as originally suggested by Takahata. However, an extension of these lattice protein simulations 
of evolution on neutral networks into a quantitative theory has been difficult because protein 
neutral networks are far too large to be computed for all but the simplest lattice models. 

Here we present a mathematical treatment of neutral protein evolution that describes the 
evolutionary dynamics in terms of the AAG values for single mutations, which are experimen- 
tally measurable. Our treatment is based on the experimentally verified (IBLOOM et al. 20051 
IBLOOM et al. 20061) connection between protein stability and mutational robustness, as well 
as a few biophysically supported assumptions about AAG values for random mutations. By 
linking a protein's tolerance to mutations with stability, we are able to quantitatively describe 
neutral evolution without a full description of the neutral network. We can then compute the 
average number of accumulated mutations, the average fraction of neutral mutations, the index 
of dispersion, and the distribution of stabilities in a neutrally evolving population solely from 
knowledge of the AAG values for single mutations. In addition, we follow the formalism of 
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VAN NlMWEGEN et al. (1999] ) to calculate all four of these properties in the limit when the 
product of the population size and mutation rate is much less than one and in the limit when 
this product is much greater than one. In demonstrating that these properties are different in 
these two limits, we show that the rate of fixation of neutral mutations can vary with population 
size in violation of one of the standard predictions of Kimura's neutral theory (IKlMURA 19871 ). 
Our work presents a unified view of neutral protein evolution that is grounded in measureable 
thermodynamic quantities. 

MATERIALS AND METHODS 
Lattice Protein Simulations We performed simulations with lattice proteins of L = 20 
monomers of 20 types corresponding to the natural amino acids. The proteins could occupy 
any of the 41,889,578 possible compact or non-compact conformations on a two-dimensional 
lattice. The energy of a conformation C is the sum of the nonbonded nearest-neighbor interac- 

L i~2 

tions, E (C) = Cij (C) x e (Ai, Aj), where Cij (C) is one if residues % and j are nearest 

i=ij=i 

neighbors in conformation C and zero otherwise, and e (Ai, Aj) is the interaction energy be- 
tween residue types Ai and Aj, given by Table 5 of |MlYAZAWA and IERNIGAN (1985[ ). We 



computed the stability of a conformation C t as AG f (C t ) = E (C t )+T\n {Q (T) - exp [-E (C t ) /T]} 
where Q (T) = ^ exp [-E (Cj) /T] is the partition sum, made tractable by noting that there 
are only 910,972 unique contact sets. All simulations were performed at a reduced temperature 
of T = 1.0 

We used adaptive walks to find sequences that folded into each of the three arbitrarily cho- 
sen conformations shown in Fig. [2] with AG/ < 0, and then neutrally evolved these sequences 
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for 10 4 generations with a population size of N = 100. Our evolutionary algorithm was as 
follows: at each generation we randomly chose a protein that folded to the parental structure 
with AG/ < from the population and mutated each residue to some other randomly chosen 
residue with probability 5 x 10 -4 , and continued doing this until we had filled the new pop- 
ulation with proteins. At the end of this equilibration evolution, we chose the most abundant 
sequence in the population as the starting point for further analysis and for the computation 
of the distribution of AAG values for all 380 point mutations (sequences shown in Fig. [2]). 
In principle, computing the distribution of AAG values over all sequences in the population 
rather than just the most abundant one should give a more accurate representation of the true 
form of this distribution, and indeed we found that doing this slightly increased the accuracy 
of the predictions shown in Fig. |2] However, the resulting improvement in accuracy was small, 
since the approximate constancy of the AAG distribution during neutral evolution (discussed 
below) means that the distribution computed over a single sequence is representative of that 
computed over all sequences in the population. Therefore, we chose to compute the AAG 
distribution over just the most abundant sequence since this choice more closely tracks what 
would be experimentally feasible with real proteins. (It is experimentally tractable to compute 
AAG values for a single protein, but would be unmanageable to do so for all proteins in a 
natural population.) 

To collect data for the case when the product of the population size N and the per 
protein per generation mutation rate /i is 1, we first equilibrated 1,000 replicates by evolv- 
ing each of them with a population size of iV = 10 and for 5,000 generations starting with 
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a clonal population of the initial sequence described above. The remainder of the evolution- 
ary algorithm was as described above: the mutation rate stayed at 5 x 10~ 4 per residue per 
generation (corresponding to a per protein per generation mutation rate of /i = 10 -2 ), and at 
each generation all proteins that folded to the target native structure with AG / < reproduced 
with equal probability. We then evolved each of these equilibrated populations for a further 
5,000 generations to collect data. We combined the data for all the folded proteins in the fi- 
nal populations of all the replicates to calculate the average number of mutations {tti)t after T 
generations, the corresponding index of dispersion R T , and the distribution of stabilities shown 
in Fig. [21 If we instead simply randomly chose a single folded protein from the final popu- 
lation of each replicate, we obtained results that were identical within the precision shown in 
Fig. |2l We emphasize that (m) T and R T were computed by keeping track of the actual number 
of mutations that had occurred during the evolutionary history of each protein, not simply by 
counting the number of amino acid differences between the ancestral and final sequences (the 
two quantities may differ if a single site undergoes multiple mutations, as discussed in more 
detail in later sections). 

To generate the data for Nfj, 3> 1, we used the same procedure but with N = 10 5 and only 
performed 10 replicates. We again computed the statistics shown in Fig. |2]by combining the 
data for all of the folded proteins in the final populations of all 10 replicates. Similar results 
were obtained if we instead computed (m)^ and Rt over all of the folded proteins in the final 
population of a single replicate (average values of (m) T were identical while the R T values of 
1.03, 0.95, and 0.94 were extremely similar to those shown from top to bottom in Fig. [2]). This 
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outcome is expected since the probability distributions for N/i 3> 1 evolve deterministically. 

Lattice Protein Predictions The numerical predictions for the lattice proteins given in Fig. [2] 
were computed by constructing the matrix W described in the first section of RESULTS with 
a bin size of b = 0.005 and truncating the matrix by assuming that no proteins would have 
stabilities less than -5.0. For the case when Nfj, <C 1, (m)^ was calculated using Equation[6] 
and R T was calculated using EquationQTJ For Nji ^> 1, (m) T was calculated using Equation 
[T8]and Rt was calculated using Equation [T9l 

RESULTS 

Assumptions and Mathematical Background In this section we describe the physical view 
of protein evolution that motivates our work. We begin with the basic observations that evo- 
lution selects for protein function, and that most proteins must stably fold in order to func- 
tion (|ANFINSEN 19731) . meaning that protein stability is under evolutionary pressure only in- 
sofar as it must be sufficient to allow a protein to fold and function. In taking this view, 
we ignore those proteins (estimated at 10% of prokaryotic and 30% of eukaryotic proteins) 
that are intrinsically disordered (IUVERSKY et al. 20051) , as well as those rare proteins that are 
only kinetically stable (IJASWAL et al. 20021) . Natural selection for function requires a pro- 
tein to fold with some minimal stability AG™ m , since proteins that lack this minimal sta- 
bility will be unable to reliably adopt their native structure and perform their biochemical 
task. A protein's extra stability beyond this minimal threshold is quantified as AG/ Xtra = 
AG/ — AG™ 111 , meaning that all functional proteins must have AG/ Xtra < (more negative val- 
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ues of AGf indicate increased stability). We further assume that as long as AGj xtra < 0, nat- 
ural selection for protein function is indifferent to the exact amount of extra stability a protein 
posesses. This assumption is at odds with the persistent speculation that high stability inher- 
ently impairs protein function and so is selected against by evolution dDEPRlSTO et al. 2005t 
ISOMERO 19951) . But the circular argument most commonly advanced to support this specula- 
tion — that the observed marginal stability of natural proteins indicates that higher stability is 
detrimental to protein function — has now been contradicted both by experiments that have 
dramatically increased protein stability without sacrificing function (ISERRANO et al. 19931 
IGiver et al. 19981; Ivan den Burg et al. 19981 IZhao and Arnold 19991 ) and by demon- 
strations that marginal stability is a simple consequence of the fact that most mutations are 
destabilizing (|TAVERNA and GOLDSTEIN 200211 I ARNOLD et al. 200 1[ as well as the current 
work). There is a possibility, however, that certain regulatory proteins must be marginally 
stable to faciliate rapid degradation (|HUNTZICKER et al. 20061) . To summarize, current bio- 
chemical evidence supports our assumption that (with certain well-defined exceptions) the only 
requirement imposed on protein stability by natural selection for protein function is that stabil- 
ity must meet or surpass some minimal threshold (a protein must have AGy xtra < 0). 

A mutation to a protein changes its stability by an amount AAG, and experimental mea- 
surements of AAG values have shown that most mutations are destabilizing (have AAG > 
0) (IPakula et al. 19861 IGodoy-Ruiz et al. 20041 IMatthews 1993t IKumar et al. 20061) . 
A mutation is neutral with respect to selection for stability if AAG + AGj xtra < since 
the mutant protein still satisfies the minimal stability threshold; otherwise the mutant does 
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not stably fold and is culled by natural selection. Of course, mutations can also have spe- 
cific effects on protein function (such as altering an enzyme's activity), but experiments have 
shown that such mutations are rare compared to the large number of mutations that affect sta- 
bility (IShortle and Lin 19851 IPakula et al. 19861 ILoeb et al. 19891 IBloom et al. 2006D . 
Mutations can also have effects unrelated to the functioning of the individual protein molecule: 
they can affect its propensity to aggregate (ICHITI et al. 20001) . alter its codon usage (|AKASHI 20031) . 
change its mRNA stability (|CHAMARY and HURST 20051) . affect the efficiency or accuracy 
of translation (|Akashi 20031 IROCHA and Danchin 20041) . or change the fraction of mis- 
translated proteins that fold (IDRUMMOND et al. 20051) . These higher-level effects are prob- 
ably most apparent in the evolution of highly expressed proteins (|DRUMMOND et al. 20051 
IPAL et al. 20011) . However, here we ignore such effects and assume that the evolutionary im- 
pact of a mutation is mostly determined by its effect on protein stability (an assumption in 
agreement with a recent bioinformatics analysis by |SANCHEZ et al. (2006| )). The view we 



present therefore describes the impact of a mutation solely by its A AG value and the AG"j. xtra of 
the wildtype protein, and is summarized graphically in Fig. [QWe have previously used a similar 
view to successfully describe experimental protein mutagenesis results (IBLOOM et al. 20051 
IBloom et al. 20061 ). 

[Figure 1 about here.] 

To use the view of Fig. \T\ to construct a useful description of neutral protein evolution, 
we make one major assumption: that the overall distribution of A AG values for random mu- 
tations stays roughly constant as the protein sequence evolves. Actually, this assumption is 
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stronger than is strictly needed for the mathematical theory presented below — the theory can 
be developed simply by assuming that all proteins with the same AG/ have the same dis- 
tribution of AAG values (in this case the matrix elements Wij introduced below depend on 
j in addition to the difference i — j). However, we make the stronger assumption that the 
AAG distribution remains constant during sequence evolution, since we believe that this as- 
sumption is consistent with existing evidence. We emphasize that this assumption does not 
imply that we are arguing that the AAG distribution is identical for every possible protein 
sequence. Clearly, for any given structure there is a most stable sequence (with all AAG 
values positive), a least stable sequence (with all AAG values negative), and a vast range of 
sequences in between. However, most of these sequences fall within a stability range that 
is never populated by evolution, since simulations (|TAVERNA and GOLDSTEIN 2002al) and 
experiments (|Keefe and Szostak 20011 IDavidson et al. 19951) clearly show that the vast 
majority of protein sequences do not stably fold into any structure (meaning the least stable 
folded protein is still far more stable than the typical random sequence). Among the sub- 
set of sequences that do stably fold, the simple statistical reality that marginally stable se- 
quences are far more abundant than highly stable sequences causes evolution to further con- 
fine itself mostly to sequences with stabilities far less than that of the most stable sequence 
(|Taverna and Goldstein 2002al I Arnold etal. 20011 as well as the current work). This 
fact is amply demonstrated by engineering experiments that have greatly increased the stability 
of natural proteins without sacrificing any of their functional properties (ISERRANO et al. 19931 
IGiver et al. 19981 Ivan den Burg et al. 19981 IZhao and Arnold 19991 ). Therefore, al- 
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though the distribution of AAG values certainly varies widely among all sequences, it is still 
reasonable to assume that it is relatively constant among those sequences visited by natural evo- 
lution. This assumption of a constant AAG distribution among evolved sequences is explic- 
itly supported by simulations ABLOOM et al. 20051 [BLOOM et al. 20061 IB ROGLIA et al. 19991 
IWlLKE et al. 20051 as well as the current work), and is consistent with the observation that the 
number of neighbors on a protein's neutral network is approximately determined by its sta- 
bility (IBornberg-Bauer and Chan 1999t IXia and Levitt 20041) . Furthermore, protein 
mutagenesis experiments indicate that the AAG values for random mutations are usually ad- 
ditive (ISERRANO et al. 19931 IWELLS 19901) , meaning that any given mutation to a protein of 
length L will alter only « 1/L of the other AAG values, leaving the AAG distribution mostly 
unchanged. Finally, the assumption of a constant AAG distribution has been shown to ex- 
plain the experimentally observed exponential decline in the fraction of functional proteins 
with increasing numbers of mutations (IBLOOM et al. 20051) . However, we acknowledge that 
at present the assumption of a roughly constant AAG distribution among neutrally evolving 
proteins can be verified only for lattice proteins — for real proteins the most we can say is that 
it is consistent with existing experimental evidence. 

We begin our mathematical treatment by conceptually dividing the continuous variable of 
protein stability into small discrete bins of width b. This discretization of stability allows us to 
treat mutations as moving a protein from one bin to another — the bins can be made arbitrarily 
small to eliminate any numerical effects of the binning. The stability of each folded protein in 
the evolving population (the folded proteins are all those with AG*j. xtra < 0) can be described 
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by specifying its stability bin. Specifically, a protein is in bin i if it has AGj xtra between 

(1 — i) b and — ib, where i — 1, 2, Let be the probability that a random mutation has 

a AAG value such that it moves a protein's stability from bin j to bin i, where i and j both 
are in the range 1,2,.... Then Wij is easily computed as the fraction of AAG values between 
b (j — i — 1) and b (j — i). Since only describes transitions between folded proteins, and 
since we have assumed that a protein's mutational tolerance is determined by its stability, then 
the fraction of folded mutants (neutrality) of a protein in bin j is Uj = W^. Clearly, more 

i 

stable proteins will have larger values of Vj. 

In the next two sections, we will use the matrix W with elements to calculate the 
distribution of stabilities in an evolving protein population of constant size N, the mean num- 
ber of mutations (m)r after T generations, the corresponding index of dispersion Rt = 

- — '-, — r — — , and the average fraction of mutations (u) that do not destabilize the proteins 

\m)T 

past the minimal stability threshold. We assume that W is computed from the distribution of 
AAG values for all random single amino-acid mutations, although in principle it could be for 
any type of mutation. We also assume that the per-protein-per-generation mutation rate ft is 
small, so that at each generation a protein undergoes at most one mutation. Our calculations 
at first follow, and then extend the theoretical treatment by |van Nimwegen etal. (1999j ) of 



evolution on a neutral network. In particular, we follow their lead in separately treating the 
two limiting cases where the product of the population size and mutation rate is 1 and 
3> 1. We emphasize that all of the equations derived in the next two sections depend only on 
the mutation rate fi, the number of generations T, and the matrix W which can be computed 
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from the single-mutant A AG values. The population size N determines the applicable limiting 
case, but otherwise drops out of all final results. 

Limit when N/i <C 1 When Nfx <C 1, the evolving population is usually clonal, since each 
mutation is either lost or goes to fixation before the next mutation occurs. If a mutation desta- 
bilizes a protein in the population beyond the stability cutoff, then it is immediately culled 
by natural selection. If a mutation does not destabilize a protein beyond the stability cut- 
off, it will be lost to genetic drift with probability ^fe^ and go to fixation with probability 
1/N (IKlMURA 1983b . Since mutations occur rarely (Nfj, <C 1), the loss or fixation of the 
mutant will occur before the next mutant appears in the population. The entire population 
therefore moves as one entity along its neutral network. The population can thus be described 
by the column vector p (i), with element (t) giving the probability that the population is in 
stability bin i at time t. 

If the population is initially in stability bin j, at each generation there is a probability 
NfiWij that a protein experiences a mutation that changes its stability to bin i, and if such a 
mutation occurs, then there is a probability of 1/iV that it is eventually fixed in the population. 
Therefore, at each generation there is a probability fjWij that the population experiences a 
mutation that eventually causes it to move from stability bin j to bin i. If we define the matrix 
V so that the diagonal elements are given by Vu = z/j and all other elements are zero, then p 
evolves according to 

p(t+l) = (I-/iV + /iW)p(t) (1) 
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where I is the identity matrix. Note that this equation treats lethal mutations (those that desta- 
bilize a protein beyond the cutoff) as immediately being lost to natural selection and so leaving 
the population in its original stability bin (hence the population accumulates a mutation with 
probability /xV rather than probability /x). Equation[T]describes a Markov process with the non- 
negative, irreducible, and acyclic transition matrix A = I — /xV + /xW, and so p approaches 
the unique stationary distribution p Q satisfying 

0=(V-W)p o . (2) 

This equation gives the expected distribution of protein stabilities solely in terms of the single- 
mutant A AG values. 

We now calculate the average number of mutations (m) T o that accumulate in an equili- 
brated population after T generations and the corresponding index of dispersion Rt,o- We em- 
phasize that (m) t , represents the average number of accumulated mutations during the course 
of the evolutionary process. When the number of accumulated mutations m is small compared 
to the length of the protein sequence L (m <C L), then m is just equal to the number of residues 
differing from those in the parent protein sequence (the Hamming distance). However, when 
m becomes substantial relative to L, m becomes larger than the Hamming distance since some 
sites will undergo multiple mutations (1JUKES and CANTOR 19691) . In this case it is necessary 
to use a substitution model to infer m from the observed Hamming distance. In the treatment 
that follows, we calculate the expected value of m; application of these formulae to actual pro- 
tein sequences requires use of one of the well-established statistical techniques for inferring 
m from the Hamming distance (IJUKES and CANTOR 19691 IGOLDMAN and YANG 19941) . We 
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begin the calculation of {m)T,o by defining p (m, t) to be the column vector with element i giv- 
ing the probability that at time t the population has accumulated m mutations and is in stability 
bin i. The time evolution of p (m, t)is given by 

p (m, t + 1) = (I - fiV) p (m, t) + /iWp (m - 1, t). (3) 

The fcth moment of the number of mutations at time t is 

c )i = e^m fc p(m,t), (4) 



where e = (1, . . . , 1) is the unit row vector. We can write a recursive equation for (m) t in the 
long-time limit (steady state) by multiplying both sides of Equation [3] by m, summing over m, 
and left multiplying by e to obtain 



(m 



)t+i = e (I — /SV) mp (m, t) + fteW mp (m — 1, t) 

m m 

= eA mp (m, t) + /ieWp Q 

= (m) t + n(u) , (5) 
where we have used the property eA = e, noted that in the long-time limit P ( m , t) = p 

m 

and m P ( m — 1? t) = Yl [( m — 1) P (m — l,t) + p (m — l,t)] = Yl m P ( m ; + Po, and 

mm m 

defined the average neutrality as (v) = eWp Q = eVp Q . Summing the recursion yields the 
steady- state value for the number of accumulated mutations, 

(m) T ,o = Tfj,{u) . (6) 
To calculate the index of dispersion R To = ^ m ) J<° ( m ) T <° ; we neec i to find the second 

{ m /T,o 

moment (m 2 ) T o . In a fashion analogous to the construction of Equation [5J we can write a 
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recursive expression for the long-time limit of (iti 2 )t,o as 



(m 2 ) t+ i = e (I — /iV) m 2 p (m, t) + /ieW m 2 p (m — 1, t) 

m m 

= eA ^ m 2 p (m, t) + 2/ieW ^ mp (m, t) + jueWp 

in 

A mp (m, t — 1) + /iWp 



(m 2 ) t + 2/ieW 



t-i 



(m 2 >, + 2/i 2 eW £ A T W Po + 



(7) 



r=0 



where we have used the property (implicit in EquationO) that in the long-time limit, m P ( m ; t) 

m 

A m P ( m , t — 1) + /iWp Q . Summing the recursion yields the following value for the long- 

m 

time limit, 

T-l t-1 



(m 2 ) T , = T /1 ( I /) + 2/i 2 eW^^A T W Po 

t=o T=0 
T 

= 7>(j/) + 2^ 2 eW ^(T-t) A'~ 1 Wp 

T 

= Tfi{u) + T (T — 1) /i 2 (z/) D 2 + 2^ 2 eW ^ (T - t) (A*" 1 - Q) W Po , (8) 



t=i 



where we have made the substitution eWQWp = (v)o and noted that lim^oo A* = Q = 
(p , . . . , p ) since A is an irreducible, aperiodic, stochastic matrix (IE WENS and GRANT 20051 ). 



This yields a value for the index of dispersion in the long-time limit of 

r t ,o = i - »W)o + (^ eW E ( x - f) ( A ^ - Q) w p°- 



(9) 



The above equation is consistent with the generic equation for the index of dispersion given 
by |Cutler (2000a| ) and |Cutler (2000bj ), where we now give concrete expressions for the 
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variables p and h (t) in Cutler's formula in terms of measureable quantitites, namely p = p{u) 
and h (t) = ^ eWA* -1 Wp . 

We can further simplify Equation|9]by performing spectral decompositions of A and Q. Let 
Ai, . . . , \k be the eigenvalues of V— W, and let r 1; . . . , r K and li, . . . , \ K be the corresponding 
right and left eigenvectors, normalized so that 1^ = 1 if % — j and otherwise. These eigen- 
vectors are also eigenvectors of the irreducible, acyclic, stochastic A, and the corresponding 
eigenvalues are 1 — p\\, . . . , 1 — p\r, with Perron-Frobenius theorems guaranteeing that one 
eigenvalue (chosen here to be 1 — ^Ai) is equal to one and all other eigenvalues have absolute 
values less than one. Then r x and l x are right and left eigenvectors of Q with eigenvalue 1 (i.e. 
ri = p and li = e), and all other eigenvalues of Q are zero. The spectral decompositions are 

K 

therefore Q = rili and A = rili + ^ (1 — p\i) rjlj. Inserting these spectral decompositions 

i=2 

into Equation HI we find for the index of dispersion a value of 

R T , = l-^(^) + -^-eW^('l-^W(l- /U A J )*" 1 ra i Wp , (10) 

\ '° t=l ^ ' i=2 

K 

since A* = r x li + (1 _ Mi)* T ih (IEWENS and GRANT 2005[) . In the limit of large T and 

i=2 

small ii, the value of Rt,o given by the above equation approaches the value 

2 T K 

Rt,o « l + -^eW^^(l- /U A i ) t - 1 r i l l W Po 

t=l i=2 

2 K 

« 1 + eW^A^Wpo, (11) 

W° i=2 

T 

where the [i(is) term drops out because p is small and the ^ (1 — Mi) ~ term drops out 
because T is large and |1 — p\\ < 1. This equation shows that Rt approaches a constant 
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value independent of T and /i. Although we could not prove that value of R T D given by Equa- 
tion[TT]is necessarily greater than one (since some of the eigenvalues Aj could be complex), in 
all of our simulations we observed R T:Q > 1, suggesting that when NpL -C 1, fluctuations in 
protein stability tend to overdisperse the molecular clock. 

Limit when Nji ^> 1 When N ji 3> 1, the population is spread across many nodes of the 
neutral network rather than converged on a single sequence dVAN NlMWEGEN et al. 19991) . 
In this limit, we treat the evolutionary dynamics of the population deterministically (i.e., we 
assume an infinite population size), and describe the distribution of stabilities in the population 
by the column vector x (t), with element Xi (t) giving the fraction of proteins in the population 
at time t that have stabilities in bin i. At generation t, the fraction of mutated proteins that 
continue to fold is (v) t = eWx (t). These folded proteins reproduce, and in order to maintain 
a constant population size, this reproduction must balance the removal of proteins by death, 
meaning that each folded sequence must produce an average of a t = [1 — \i (1 — (^)t)]~ 
offspring. The population therefore evolves according to 



After the population has evolved for a sufficient period of time, x approaches an equilibrium 
distribution of x^. The corresponding equilibrium neutrality is iy)^ = eWx^, and the 
equilibrium reproduction rate is a = [1 — \i (1 — (v)oo)]~ ■> so 



x (t + 1) = ott [(1 - n) I + /iW] x it) . 



(12) 




(13) 
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This equation can be rewritten to show that is the principal eigenvector of W, 



■oo 



Wx 



(14) 



We note that (u)^ approximates the asymptotic neutrality for the decline in the fraction of 
folded proteins upon random mutagenesis dBLOOM et al. 20051 IWlLKE et al. 20031) . 

We now determine the average number of accumulated mutations (m) T oo and the corre- 
sponding index of dispersion -Rt.oo by treating the forward evolutionary process. As described 
in the text immediately prior to Equation [3l our calculations describe the actual number of 
mutations accumulated during the evolutionary process, which may differ from the number 
of sequence differences relative to the ancestor if a single site undergoes multiple mutations. 
When N/j, ^> 1, it is not a priori obvious that the average number of mutations present in the 
population is equivalent to number of fixed substitutions along the line of descent. Therefore, 
in the APPENDIX, we show that identical results are obtained by tracing a randomly chosen 
protein backwards in time along its ancestor distribution, proving the treatment we give below 
is mathematically equivalent to treating the time-reversed process. We define x (m, t) as the 
column vector with element i giving the fraction of the population at time t that has accumu- 
lated m mutations and is in stability bin i. Once the population has reached the equilibrium 
distribution of stabilities, the time evolution of x (m, t) is 



x (m, t + 1) 



a (1 — /jl) x (m, t) + a/iWx (m — l,t). 



(15) 



The recursion can be solved to obtain 




(16) 
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as can be verified by direct substitution. Since we are assuming the population has equilibrated 
at time and no mutations have accumulated at that time, x (m, 0) is for m = and 
otherwise. Furthermore, Xoo satisfies Equation[[H so multiplying Equation [16] by e yields 

x (m, t) = ( t ) a* (1 - fif- m (^>oo) m , (IV) 
\m J 

where x (m, t) = ex (m, t) gives the fraction of the population that has accumulated m muta- 
tions after t generations. The average number of accumulated mutations after T generations is 
the mean of this binomial distribution, 

(m) T ,oo = 7z t^t- (18) 

Using the well known result for the variance of the binomial distribution, we find that the index 
of dispersion is 

Rt.oo = 1- 1 T^W-y d9) 

1 — A* (1 — Woo) 

It is important to reiterate that the above equation was derived under the assumption that there 
is at most one mutation per sequence per generation. For realistic distributions of mutations 
(i.e. Poisson), this means that fi «C 1. In this regime, Rt,oo is close to one. 

Lattice Protein Simulations We tested our theory's predictions on the evolutionary dynamics 
of lattice proteins. Lattice proteins are simple protein models that are useful tools for study- 
ing protein folding and evolution (ICHAN and BORNB ERG-BAUER 20021 ). Our lattice proteins 
were chains of 20 amino acids that folded on a two-dimensional lattice. The energy of a lattice 
protein conformation was equal to the sum of the pairwise interactions between non-bonded 

24 



amino acids (IMlYAZAWA and JERNIGAN 1983T ). Each lattice protein has 41,889,578 possible 
conformations, and by summing over all of these conformations we could exactly determine 
the partition sum and calculate AG/. We set a minimal stability threshold for the lattice pro- 
teins of AG" 1111 = 0, meaning that we considered all proteins that folded to the target structure 
with AG/ < to be folded and functional, while all proteins with AG/ > were considered 
to be nonfunctional. We note that this stability threshold is equivalent to requiring a lattice 
protein to spend at least half of its time in the target native structure at equilibrium. We began 
by generating lattice proteins that stably folded to each of the three different structures shown 
in Figure [21 For each of these three proteins, we determined the distribution of A AG values 
for all 380 single mutations (these distributions are shown in Figure [2]). These distributions 
were used to construct the matrix W and to predict the equilibrium distribution of stabilities, 
the average number of mutations, and the indices of dispersion for both the N fi <C 1 and the 
NpL 3> 1 cases, using the equations presented in the preceding sections. 

To test the accuracy of these predictions, we then simulated evolving populations of the 
lattice proteins with a standard evolutionary algorithm using Wright-Fisher sampling. Briefly, 
the populations were held at a constant size of either N = 10 or N = 10 5 . At each generation, a 
new population was created by choosing parents with equal probability from all folded proteins 
in the previous generation's population, and copying these parents into the new population 
with a mutation rate of 5 x 1CT 4 mutations per residue per generation. Since the proteins 
have a length of 20 amino acids, this mutation rate corresponds to a per-protein-per-generation 
mutation rate of ji = 10~ 2 . Therefore, the product Nfi is either 0.1 or 10 3 , corresponding 
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to Nfi C 1 or N/j, ^> 1, respectively. We emphasize that the lattice protein evolutionary 
algorithm is the same for both population sizes. When N = 10 the population naturally follows 
dynamics approximating those presented for N^i <C 1, while when N = 10 5 it naturally 
follows dynamics approximating those presented for iV/i 3> 1 (as evidenced by the excellent 
agreement of the predictions with the simulations). For N = 10, we performed 1,000 replicates 
for each different structure. For N = 10 5 , computational constraints limited us to 10 replicates 
for each structure (however the evolutionary dynamics are nearly deterministic in this case, 
so all replicates yielded similar results). We note that during the simulations we recorded the 
number of mutations that actually accumulated rather than simply computing the number of 
differences (Hamming distance) from the original sequence. 

Figure [2] shows the theoretical predictions and simulation results for each of the three struc- 
tures. The theoretical predictions are in good agreement with the simulation results. Figure [2] 
clearly shows that when Nfi ^> 1, the proteins tend to be more stable than when iV/i <C 1. 
This extra stability is a biophysical manifestation of the neutrally evolved mutational robust- 
ness predicted by | VAN NlMWEGEN et al. (1999| ). This increase in stability leads to a substan- 



tial increase number of accumulated mutations. In accordance with the theoretical predictions, 
when Nfi <C 1 the index of dispersion is elevated above one by fluctuations in protein stability. 
Another clear results from the simulations is that proteins of different structure show markedly 
different distributions of stabilities and rates of sequence evolution due to the differences in 
their AAG distributions. Overall, the simulations offer strong support for the validity of the 
theoretical predictions in the preceding sections. 
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[Figure 2 about here.] 



DISCUSSION 

We have presented that a theory that offers quantitative predictions about the distribution of 
stabilities, the average number of fixed mutations, and the index of dispersion for an evolving 
protein population in terms of the AAG values for individual mutations. We have demon- 
strated that these predictions are accurate for simple lattice proteins, and have used existing 
biophysical evidence to argue that the basic theoretical assumptions should also be accurate for 
real proteins. In this section, we give qualitative interpretations of the mathematical results and 
discuss their implications for our understanding of protein evolution. 

One major result is to show that the effects of protein structure on the rate of sequence 
evolution can be quantitatively cast in terms of the AAG values for single mutations. Nu- 
merous lattice protein simulations have shown that protein structure can dramatically affect 
the rate of sequence evolution, since structures that are more "designable" (encoded by more 
sequences) can evolve their sequences more rapidly (as can be seen in Fig. [2] of this work) 
(IGovindarajan and Goldstein 1997[[Bornberg-Bauer and Chan 1999UTiana etal. 20001 
IXia and Levitt 2004ULI et al. 1996HChan and Bornberg-Bauer 20021 IWingreen et al. 20041) . 
Unfortunately, these simulations typically measure structural designability by enumerating a 
large number of lattice protein sequences, meaning that their findings cannot be extended to 
real proteins for which such extensive enumeration is impossible. However, recent theoretical 
work by [England and Shakhnovich (2003] ) has made progress in connecting designabil- 



ity to observable structural properties, and a bioinformatics analysis based on this theoretical 
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measure of designability indicates that structure indeed influences the evolutionary rate of real 
proteins (IB LOOM et al. 2006b1 ). Our work provides a way to quantitatively relate the struc- 
tural influences on protein evolution to experimentally measureable A AG values, opening the 
door to further connecting structural designability and sequence evolution to laboratory stabil- 
ity measurements. Although thousands of AAG values have been measured experimentally 
(|KUMAR et al. 20061) . at present there are no large sets of measurements for truly random mu- 
tations to a single protein. When such sets of measurements become available, it should be 
possible to use them in conjunction with the theory that we have presented to predict the neu- 
tralities of real proteins with different structures. 

A second important result is to show that protein evolutionary dynamics can depend on the 
product of population size and mutation rate, Nix. When N/x 3> 1, the evolving protein pop- 
ulation is polymorphic in stability and subject to frequent mutations, so the more stable (and 
thus more mutationally tolerant) proteins produce more folded offspring. In contrast, when 
N/x <C 1, the population is usually monomorphic in stability and so all members of the popu- 
lation are equally likely to produce folded offspring. The general tendency for populations to 
neutrally evolve mutational robustness when N/x ^> 1 has previously been treated mathemati- 



cally by |VAN NlMWEGEN et al. (1999[ ), and a variety of lattice protein simulations have noted 
the tendency of evolving protein populations to preferentially occupy highly connected neu- 
tral network nodes (IBornberg-Bauer and Chan 19991; ITaverna and Goldstein 2002bt 
IXia and Levitt 20041) . Our work shows that for proteins, in the limiting cases when N /i <C 1 
or 3> 1, this process can be rigorously described by considering only protein stability, rather 
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than requiring a full analysis of the neutral network (provided, as we have argued is likely to be 
the case, that the assumption of a roughly constant AAG distribution holds for real proteins as 
well as it holds for our lattice proteins). In addition, we prove that the number of accumulated 
mutations depends on whether N/j is 1 or ^> 1. This finding is at odds with the standard 
prediction (IKlMURA 19871 ) of Kimura's neutral theory that the rate of evolution is independent 
of population size. The reason for this discrepancy is that the standard neutral theory fails to ac- 
count for the possibility that increasing the population size so that Nji ^> 1 can systematically 
increase the fraction of mutations that are neutral. 

A third important contribution of our theory is to use the distribution of A AG values for sin- 
gle mutations to predict the distribution of protein stabilities in an evolving population. Several 
researchers have pointed out that evolved proteins will be marginally stable simply because 
most mutations are destabilizing (|TAVERNA and GOLDSTEIN 200211 I ARNOLD et al. 20011) : 
we have described this process quantitatively. In addition, we have shown how the neutral 
evolution of mutational robustness when Nji ^> 1 will shift the proteins towards higher stabil- 
ities (as shown in Fig. [2]), although this increase in stability is limited by the counterbalancing 
pressure of predominantly destabilizing mutations. The formulae we provide can in princi- 
ple be combined with experimentally measured AAG values to predict the expected range of 
stabilities for evolved proteins. 

Our work also weds Takahata's concept that fluctuating neutral spaces might overdisperse 
the molecular clock (ITakahata 19871 ICutler 2000bi IBastolla et al. 20021) to a concrete 
discription of how protein neutrality fluctuates during evolution. When Nfi <C 1, fluctuations 
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in protein stability can cause an overdispersion in the number of accumulated substitutions 
that can be calculated from the single-mutant AAG distribution. Furthermore, given our as- 
sumption of a roughly constant A AG distribution, we show that the index of dispersion will ap- 
proach a constant value that is independent of time or mutation rate, but will depend on whether 
N n <C 1 or ^> 1. Previous simulations have indicated that overdispersion indeed depends on 
the population size (IBASTOLLA et al. 20021 IWlLKE 20041) — we have explained this depen- 
dence by showing that stability-induced overdispersion does not occur when Nfj, ^> 1 since 
the population's distribution of stabilities equilibrates as it spreads across many sequences. 
Mathematically, the difference in the cases N [i 3> 1 and N fi <C 1 is that, assuming the AAG 
distribution remains relatively constant, when the population size is sufficiently large, the dis- 
tribution of protein stabilities no longer fluctuates in a manner that influences the probability 
of a substitution (Equation [3] contains /iV in the first term on the right side, while Equation fT?l 
does not). 

In summary, we have presented a mathematical theory of how thermodynamics shape neu- 
tral protein evolution. A major strength of our theory is that it makes quantitative predictions 
using single-mutant AAG values, which can be experimentally measured. Our work also 
suggests how neutral and adaptive protein evolution may be coupled through protein thermo- 
dynamics. Protein stability represents an important hidden dimension in the evolution of new 
protein function, since extra stability that is itself neutral can allow a protein to tolerate mu- 
tations that confer new or improved functions dBLOOM et al. 2006T) . Our theory describes the 
dynamics of protein stability during neutral evolution — adaptive protein evolution is super- 
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imposed on these stability dynamics, with proteins most likely to acquire beneficial mutations 
when they are most stable. 
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APPENDIX 

Here we calculate the properties of the evolving population when Nji ^> 1 by analyzing the 
time-reversed process to compute the mean and variation in the number of mutations in a single 
randomly chosen protein over time. We show that the results so obtained are identical to those 
found in the main text, where we analyzed the forward-time process to compute the mean and 
variation in the number of mutations across the population of evolving proteins. 

When Nji ^> 1, the population is now never converged to a single sequence, so it is not a 
priori obvious that the average number of mutations present in the population is equivalent to 
the expected number of fixed substitutions along the line of descent. In fact, in the limit of very 
large population sizes there may not even be a common line of descent in relevant time frames, 
since many new mutations will occur before any given mutation goes to fixation. In the main 
text we calculated the average number of mutations (m)T,oc a sequence in the population has 
accumulated over the last T generations by treating the forward evolution of the population. 
Here we trace a randomly chosen protein in the population back in time, and show that the 
average number of substitutions (s) T that it has accumulated over the last T generations is 
equal to (m) T oo . We also show that indices of dispersion of (m) T oo and (s) T have the same 
value of -Rt,oo- 

To calculate (s) T , we first define a vector a giving the ancestor distribution (1HERMISSON et al. 20021) 
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element i of a (T — t) gives the probability that a randomly chosen sequence from the popula- 
tion at time T had a predecessor with stability in bin i at time T — t. The transition probabilities 
of a (T — t) when the population is in equilibrium are the discrete time analogue of those com- 
puted by |HERMISS0N etal. (2002| ). From Equation [T5] of the main text, it follows that the 



fraction of sequences in bin % at time t + 1 that had as their ancestor in the previous generation 
a sequence in bin j is a t [(1 — fi) 5^ + fjWij] Xj (t). In order to obtain the probability that a 
sequence in bin i at time t + 1 had an ancestor in bin j, we have to divide this fraction by the 
total number of sequences in bin i at time t + 1. When the population is at equilibrium, a t = a 
and Xi (t + 1) = Xi (t) = Xi where Xi is the element from x^. Hence, the probability that a 
sequence in bin i had an ancestor in bin j is a [(1 — fi) §ij + [iHj], where we have defined 

Hji = WijXj/xi, (20) 

The time evolution of a is therefore 

a (T - t) = a [(1 - fi) I + fiH] a (T - t + 1) , (21) 

where the matrix H is defined by Equation [201 Equation [2T] can be solved to show that the 
equilibrium value of a is a^ satisfying 

(z/) 00 a 00 = Haoo. (22) 

If we define a (s, T — t) as the vector with element i giving the probability that a randomly 
chosen sequence at time T had a predecessor at time T — tin stability bin i and with s substitu- 
tions relative to the sequence at time T, then the time evolution for an equilibrated population 
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IS 



a(s, T - t - 1) = a (1 - /i) a(s, T - t) + a/iHa (s — 1,T — t) . 



(23) 



We can solve Equations |23] and l22l in a manner analogous to the forward process to obtain 



Again defining a(s,T — t) = ea (s, T — t) as the probability of having accumulated s substi- 
tutions as one moves back t generations from time T, we obtain the binomial distribution 



Comparison of Equation [FT] of the main text and Equation [25] shows that they are identical. 
Therefore, all moments computed from the two distributions must be equal. In particular, this 
proves that (m)T, oo = ( s )t, and that the corresponding indices of dispersion have the same 
value of Rt,oo defined by Equation [19] of the main text. This shows that when Nfi 3> 1, we 
expect equivalent results regardless of whether we average over the number of mutations in 
all sequences present in the population, or randomly choose a single sequence and trace back 
along its ancestor distribution. 




(24) 




(25) 
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List of Figures 

A thermodynamic view of protein evolution. A mutant protein stably folds 
if and only if it possesses some minimal stability, AG™ 111 (in this case -5 
kcal/mol). The stability of the wildtype protein is AGPp = —7.5 kcal/mol, 
meaning that it has AGj xtra = —2.5 kcal/mol of extra stability. The bars show 
the distribution of A AG values for mutations. Those mutants with AGj xtra + 
A AG < still stably fold, while all other mutants do not fold and so are culled 
by natural selection. The probability that a mutation will be neutral with re- 
spect to stable folding is simply the fraction of the distribution that lies to the 

left of the threshold. The data in this figure are hypothetical 

The theory gives accurate predictions for the evolution of model lattice pro- 
teins. Each row of panels corresponds to a different lattice protein. The graphs 
at left show the starting protein and the distribution of AAG values for all point 
mutations. The graphs in the middle and right show the predicted (lines) and 
measured (boxes) distributions of stabilities among the evolved proteins. The 
tables embedded in the graphs show the predicted and measured values for the 
average number of mutations ((m) T ) and the index of dispersion (Rt) after 
5,000 generations of neutral evolution. The center graphs are for a population 
size of N = 10, and the graphs at the right are for N = 10 5 . In both cases, the 
per protein per generation mutation rate is /i = 0.01. As predicted, the evolving 
population with Nji ^> 1 evolved mutational robustness that is manifested by 
increased protein stability. This additional mutational robustness accelerated 
the rate of sequence evolution 
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Figure 1 : A thermodynamic view of protein evolution. A mutant protein stably folds if and 
only if it possesses some minimal stability, AG™ 111 (in this case -5 kcal/mol). The stability of 
the wildtype protein is AGJ 1 = —7.5 kcal/mol, meaning that it has AGy xtra = —2.5 kcal/mol 
of extra stability. The bars show the distribution of A AG values for mutations. Those mutants 
with AGj xtra + A AG < still stably fold, while all other mutants do not fold and so are culled 
by natural selection. The probability that a mutation will be neutral with respect to stable 
folding is simply the fraction of the distribution that lies to the left of the threshold. The data 
in this figure are hypothetical. 
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Figure 2: The theory gives accurate predictions for the evolution of model lattice proteins. 
Each row of panels corresponds to a different lattice protein. The graphs at left show the start- 
ing protein and the distribution of AAG values for all point mutations. The graphs in the 
middle and right show the predicted (lines) and measured (boxes) distributions of stabilities 
among the evolved proteins. The tables embedded in the graphs show the predicted and mea- 
sured values for the average number of mutations ((m)y) and the index of dispersion (Rt) after 
5,000 generations of neutral evolution. The center graphs are for a population size of N = 10, 
and the graphs at the right are for N = 10 5 . In both cases, the per protein per generation 
mutation rate is fj, — 0.01. As predicted, the evolving population with Nji ^> 1 evolved mu- 
tational robustness that is manifested by increased protein stability. This additional mutational 
robustness accelerated the rate of sequence evolution. 



43 



0.2 



0.1 



-1 



K — H — R 

E— H— A— L F— E 
I 

H Y— L F— S 

I I 
K— F I T 

I 

m KttQ 



0.3 



3 0.2 
1 0.1 



N/i < 1 


(m) T 


Rt 




predicted 


9.7 


1.49 


1 


- measured 


9.4 


1.43 




-2 


-1 








AAG 



AG/ 




